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Abstract 

In this paper, we build an organization of high-dimensional datasets that cannot 
be cleanly embedded into a low-dimensional representation due to missing entries and 
a subset of the features being irrelevant to modeling functions of interest. Our algo¬ 
rithm begins by defining coarse neighborhoods of the points and defining an expected 
empirical function value on these neighborhoods. We then generate new non-linear 
features with deep net representations tuned to model the approximate function, and 
re-organize the geometry of the points with respect to the new representation. Finally, 
the points are locally z-scored to create an intrinsic geometric organization which is in¬ 
dependent of the parameters of the deep net, a geometry designed to assure smoothness 
with respect to the empirical function. We examine this approach on data from the 
Center for Medicare and Medicaid Services Hospital Quality Initiative, and generate 
an intrinsic low-dimensional organization of the hospitals that is smooth with respect 
to an expert driven function of quality. 


1 Introduction 

Finding low dimensional embeddings of high dimensional data is vital in understanding the 
organization of nnsnpervised data sets. However, most embedding techniqnes rely on the 
assnmption that the data set is locally Enclidean HlEllI]. In the case that featnres carry 
implicit weighting, some featnres are possibly irrelevant, and most points are missing some 
snbset of the featnres, Enclidean neighborhoods can become spnrions and lead to poor low 
dimensional representations. 

In this paper, we develop the method of expert driven fnnctional discovery to deal with the 
issne of spnrions neighborhoods in data sets with high dimensional contrasting featnres. 
This allows small amonnts of inpnt and ranking from experts to propagate throngh the 
data set in a non linear, smooth fashion. We then bnild a distance metric based off these 
opinions that learns the invariant and irrelevant featnres from this expert driven fnnction. 
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Finally, we locally normalize this distance metric to generate a global embedding of the 
data into a homogeneons space. 

An example to keep in mind thronghont the paper, an idea we expand npon in Section 
is a data set containing pnblicly-reported measnrements of hospital qnality. The Center for 
Medicare and Medicaid Services Hospital Qnality Initiative reports approximately 100 dif¬ 
ferent measnres describing varions components of the qnality of care provided at Medicare- 
Certified hospitals across the United States. These featnres range in measnring hospital 
processes, patient experience, safety, rates of snrgical complications, and rates of varions 
types of readmission and mortality. There are more than 5, 000 hospitals that reported at 
least on measure dnring 2014, bnt only 1,614 hospitals with 90% measnres reported. The 
measnres are compnted qnarterly, and are pnblicly available throngh the Hospital Compare 
website j^. The high dimensional natnre of these varied measnres make comprehensive 
inferences abont hospital qnality impossible withont snmmarizing statistics. 

Onr goal is more than jnst learning a ranking fnnction / on the set of hospitals X. We 
are trying to characterize the cohort of hospitals and organize the geometry of the data 
set, and learn a mnlti-dimensional embedding of the data for which the ranking fnnction is 
smooth. This gives an nnderstanding of the data that doesn’t exist with a one dimensional 
ranking fnnction. Specifically, we are looking for meta-featnres of the data in order to bnild 
a metric p : X x X ^ M"*" that indnces a small Lipschitz constant on the fnnction /, as 
well as on featnres measnred by CMS. 

An example of this organization is shown in Fignrej^ The organization is generated via onr 
algorithm of expert driven fnnctional discovery, the details of which are fonnd in Sections 
and 1^ The colors in each image correspond to three notable CMS featnres: risk stan¬ 
dardized 30 day hospital-wide readmission, patient overall rating of the hospital, and risk 
standardized 30 day mortality for heart failnre. This organization snccessfnlly separates 
hospitals into regimes for which each featnre is relatively smooth. 



Fignre 1: Organization colored by: (left) risk standardized 30 day hospital wide readmission, 
(center) percent patients rating overall hospital 9 or 10 ont of 10, (right) risk standardized 
30 day mortality for heart failnre. Embedding generated via bigeometric organization of 
deep nets. Red is good performance, bine is bad. 
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Figure 2: (left) unnormalized embedding, (right) z-scored embedding. 


Our organization is accomplished via a two step processing of the data. First, we look to 
characterize the topology of the data set by creating a stable representation of X such that 
neighborhoods can be differentiated at varying scales, and neighborhoods are preserved 
across varying parameters of the algorithm. This is done in Sections and using a spin 
cycling of deep nets trained on some expert driven function. Second, we build a metic 
on that topology by taking a local Mahalanobis distance on the stable neighborhoods (i.e. 
local z-scoring) |11[|16] . This is done in Section 3.3 and guarantees that the induced metric 
is homogeneous. This means that, if denotes the neighborhood of a point x, then p(x, y) 
for x,y £Ux measures the same notion of distance as p{x',y') for x',y' G Ux'- 


Figure|^demonstrates the issue we refer to here. The left image shows an organization of the 
hospitals in which neighborhoods have not been normalized. The red points refer to “good” 
hospitals, and the blue refer to “bad” hospitals (in a naive sense). The organization on the 
left gives no notion of the spread of the points and leaves certain questions unanswerable 
(i.e. are “good” hospitals as diverse as “bad” hospitals). However, if we locally z-score the 
regions of the hospitals, as in the right figure, it the global notion of distance is normalized 
and it becomes clear that “good” hospitals share many more similarities amongst themselves 
than “bad” hospitals do among themselves. 


Also, by taking a local z-scoring of the features, we generate an organization that is depen¬ 
dent only on the neighborhoods Ux, rather than being dependent on the specific representa¬ 
tions used. Figure]^ shows the organizations of the hospitals generated by algorithms with 
two very different parameter sets which, after z-scoring the neighborhoods, generate similar 
embeddings. More details about this embedding can be found in Section [4.3.3 

However, discovering the topology of the hospitals is non-trivial. The features may have 
significant disagreement, and not be strongly correlated across the population. To examine 
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Figure 3: Two sets of organizations with similar neighborhood structure. The representa¬ 
tions used to generate these embeddings are fundamentally different, with the right figure 
using 5x as many features as the left, and with features being generated with vastly different 
algorithmic parameters. 


these relationships, one can consider linear correlations via principal component analysis. 
The eigenvalues of the correlation matrix do not show the characteristic drop off shown in 
linear low dimensional data sets. In fact, 76 of the 86 eigenvalues are above 1% the size of 
the largest eigenvalue. Previous medical literature has also detailed the fact that many of 
the features don’t always correlate mm- 

For this reason, there does not exist an organization for which all features are smooth and 
monotonically increasing. This is why the meta-features, and organization, must be driven 
by minimal external expert opinion. This observation makes the goal of our approach three 
fold: develop an organization of the data that is smooth with respect to as many features 
as possible, build a ranking function / that agrees with this organization, and minimize the 
amount of external input necessary to drive the system. 


Our expert driven functional discovery algorithm blends the data set organization of diffu¬ 
sion maps and coupled partition trees with the rich set of non-linear features generated by 
deep learning and stacked neural nets. We build an initial organization of the data via cou¬ 
pled partition trees |2], and use this partitioning to generate pseudopoints that accurately 
represent the span of the data at a coarse scale. This step is explained in Section This 
organization is analyzed and used as an input to a series of stacked neural nets, which learn 
invariant representations of the data and separate disparate clusters of points. This step is 
explained in Section See |2] for a review of stacked neural nets and deep learning. 


It is important to note that our use of stacked neural nets is different from traditional 
deep learning applications. We discuss these differences in Section 3.5 The purpose of 
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using deep learning and organizing of the generated representations is to create a notion 
of fine neighborhoods between points; neighborhoods where the number of neighbors scales 
smoothly with the distance metric. 

We then examine and validate our algorithm on the CMS Quality Initiative features in 
Section HI 


2 Information Organization and Expert Driven Functional 
Discovery 

2.1 Training on Data with Full Knowledge 

Let the data matrix be 


M =[vi,...,vn], (1) 

where Vi G M™' is a vector of observations describing the data point. Each Vi is al¬ 
lowed to have arbitrarily many missing entries. Define supp(nj) = {A; G {l,...,m} : 
Vi^k is observed}. 

Due to the missing entries, calculating an affinity between every two points Vi and Vj is 
not necessarily possible, given that the intersection of the supports of their known values 
may be small or even disjoint. For this reason, we begin by restricting ourselves to points 
Vi- that have at most r] missing entries. We shall begin by organizing the set of points 
D = [vi^,...,vi^]. 

To gain an initial understanding of the geometry of D, we consider the cosine affinity matrix 
A where 


■^j,k 





( 2 ) 


where the inner product is calculated only on the entries in supp(nj^)nsupp(njj. By defi¬ 
nition of D, this set contains at least m — 2rj known values. 

The cosine affinity matrix serves as a good starting point for learning the geometry of 
D. However, we must develop a way to extend any analysis to the full data M. For this 
reason, we partition both the data points and the observation sets of D. This gives us two 
advantages: partitioning the data points captures the ways in which different observations 
may respond to different subsets of D, and partitioning the observation sets into similar 
question groups gives a method for filling in the missing observations in M. 
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We construct a coupled geometry of fl using the algorithm developed in |5]. The initial 
affinity is given by the cosine affinity matrix A, and the iterative procedure is updated using 
Earth Mover Distance El- 

Remark: Let the hnal affinity matrix be called ^4 : M x M —)> [0,1]. Let the eigenpairs of 
A be called with 1 = Aq > ••• > Aat. Then the organization of M is generated 

by 


d>*(x) = [X\4>i{x),Xa4>d{x)], X e M, 


where d is the dimension of the underlying manifold. 


2.2 Filling in Missing Features 


Let A^obs be the hierarchical tree developed on the observations in from Section 2.1 
Let the levels be ..., with the nodes for level I named ..., Let Vi ^ Q 

be a data point with the entry Vi^k missing. In order to add Vi into the geometry of D, 
we must estimate the entries in (supp(t>j))'^ to calculate an affinity between Vi and other 
points. 


gives a tree of correlations between the observations. This allows us to hll in Vi^k with 
similar, known entries. Find the lowest level of the tree (most strongly correlated questions) 
for which observation k G SX- and 3m G such that Vi^m is known. Then the estimate 
of Vi^k satishes 




(3) 


Along with an estimate of Vi^k: (|^ also gives a level of uncertainty for the estimate, as smaller 
I (i.e. coarser folders) have lower correlation and give larger reconstruction error. 


2.3 Expert Driven Function on the Folders 


Let ^points be the hierarchical tree developed on the data points in D from Section 2.1 Let 
the levels be ..., with the nodes for level I named ..., As the partitioning 

becomes hner (i.e. I approaches L), the folders contain more tightly clustered points. This 
means that the distance from an point to a centroid of a folders becomes smaller as the 
partitioning becomes hner. 


Fix the level I in the tree. The centroids of these folders can be thought of as “types” of 
data points, or pseudopoints. There are two major benehts: there are a small number of 
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pseudopoints relative to n that span the entire data space, and the pseudopoints are less 
noisy and more robust to erroneous observations. 

These pseudopoints are the key to incorporating expert knowledge and opinion. The pseu¬ 
dopoints are easier and much faster to classify than individual points, as there are a small 
number and they are less noisy than individual points. Also, the pseudopoints effectively 
synthesize the aggregate performance of multiple hospitals. The classihcations generated 
by experts can be varied, anything from quality rankings to discrete classes to several de¬ 
scriptive features or “meta-features” of the bins. Specihcally, the user assigns a set of classes 
^ and a classihcation function g : Q. ^ ^ such that 

Vx G jr/, g{x) = Vj E (4) 

This function is understood as a rough estimate, since the classihcation is applied to all 
X E even though the class is determined only from the centroid of 

This step is non-traditional in unsupervised classihcation algorithms. Traditionally, the 
clustering is done agnostic to the hnal goal of classihcation. However, this does not allow for 
a second step correction of the original classihcation. By allowing a user driven component 
that quickly examines the cluster centroids, we are able to learn an initial classihcation 
map for the points. This provides a rough characterization of which clusters should be 
collapsed due to similar classihcation scores, which clusters should be separated further due 
to drastically different classihcation scores, and a method of determining class labels for 
points on the border of multiple clusters. 

This function gives a rough metric /? : H x H —)• M"*" that has dependencies of the form 

P{x,y) = f {x-y,g{x),g{y)). / : M”'x x ^M+. (5) 

This metric needs to satisfy two main properties: 

1. 3ho such that p{x,y) < €dist if — y\\ < ^ for h < ho and some norm || • ||, and 

2. p{x,y) > eciass if 9{x) / g{y). 

A metric that satishes these two properties naturally relearns the most important features 
for preserving clusters while simultaneously incorporating expert knowledge to collapses 
non-relevant features. We learn this function using neural nets, as described in Section 

El 

3 Deep Learning to Form Meta-Features 

There are entire classes of functions that approximate the behavior of p in (|^. For our al¬ 
gorithm, we use neural nets for several reasons. First, the weight vectors on the hrst layer of 
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a neural net have clear, physical interpretation, as the weight matrix can be thought of as a 
non-linear version of the weight vectors from principal component analysis. Second, current 
literature on neural nets suggest a need for incredibly large datasets to develop meaningful 
features on unsupervised data. Our algorithm provides a way to turn an unlabeled data set 
into a semi-supervised algorithm, and incorporates this supervision into the nodes of the 
neural net. This supervision appears to reduce the number of training points necessary to 
generate a non-trivial organization. Past literature has used radial basis functions as the 
non-linear activation for deep learningj^ , as well as for analysis of the structure of deep 
net representations jl2| . 

3.1 Neural Nets with Back-Propagation of Rankings 

For our algorithm, we build a 2 layer stacked autoencoder with a sigmoid activation function, 
with a regression layer on top. The hidden layers are defined as 

= a , 

with cj : M —>■ [0,1] being a sigmoid function applied element-wise, and h^^\x) = x. The 
output function f{x) is logistic regression of these activation units 

f{x) = + Vh^‘^\x)). 

The reconstruction cost function for training our net is an reconstruction error 

^ = ( 6 ) 

^ • 1 
t=l 

The overall loss function we minimize, which combines the reconstruction cost with several 
bounds on the weights, is 

i=l I i,j 

Note that we rescale g if it takes values outside [0,1]. We then backpropagate the error by 
calculating and and adjusting the weights and bias accordingly. See |11) for a 

full description of the algorithm. 

Definition 3.1. The deep neural net metric on it with respect to an external function f is 
defined as 


pDNN{x,y) = W^^Hx) - y^\y)\\. 




Lemma 3.2. A deep neural net with a logistic regression on top generates a metric pdnn 
that satisfies Condition 1 from ([^ with a Lipschitz constant of ||PLi||/4 with respect to Eu¬ 
clidean distance. The output function f also has a Lipschitz constant of ||VLi || IIIL 2 II ||^||/64 
with respect to Euclidean distance. 

Proof. Let cr{x) = ■ Then ^ = (t(x)(1 — cr(x)) < By the mean value theorem, 

|c7(o) - a{b)\ = |a - b\\^{z)\ < Then 

\\f{x)-f{y)h < \\V\\\\h^^\x)-h(^\y)\\/4 

< \\V\\\\W^^h^^\x) - 

< \\V\\\\W^^'>\\\\h^^\x) - h(^\y)\\/16 

< ||F||||lT(2)||||^(i)||||^_y||/64. 


The same argument applies for poNN- 

□ 

Lemma 3.3. A two layer neural net with a logistic regression on top creates a function f 
which satisfies a variant of Condition 2 from (|^, namely that 

(||9(x) - 9(!/)|P) - 2 c, (7) 


where S = ff{{x,y) G Ll x Q : g{x) g{y)}, Si = ff{y G n : g{y) i}, and is the 
expected value over the set S. 

Moreover, the deep neural net poNN generated also satisfies 


E^{\\fix)-fiy)f) > 



E^ {\\g{x) - g{y)f) 


- 2 




S 


( 8 ) 


Proof. We have 


\\fix)-fiy)f = Wfix) - gix) - f{y) + g{y) + g{x) - g{y)f 

> Il5(a;) - 9{y)f - (ll/(a;) - gix)f + \\f{y) - g{y)f) ■ 


Unfortunately, because ([^ is a global minimization, we cannot say anything meaningful 
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about the difference for individual points. However, we do have 


\\fix)-f{y)f 

9{x)¥=9{y) 


> Y1 \\9{x)-9{y)f- Y1 - 9{x)f+ \\f{y) - 9{y)f) 

9{x)¥=9iy) 9{x)¥=9iy) 

= \\9{x)-9{y)\\-‘^Y.^^y-S^^^^3{y)}-\\f{x)-9{x)f 

9(^)¥=9iy) 

> \\ 9 {x) - 9 {y)\\ - (Tf ax#{y: g{y)^i}\ -nC, 

9{x)j^9{y) 


where #{y : 9 {y) ^ i} denotes the number of elements in this set. Let S = i^{{x,y) G 
H X H : g{x) ^ g{y)} and Si = #{y G H : g{y) / i}. Then 


E,i(ll/W-/(!/)R > (||g(x) - g(i<)||") - 2 c. 


This means that by minimizing C, we are forcing the separation of points with different 
initial ranking to be as large as possible. This makes enforces Condition 2 of (|^ in the 
aggregate over all such points. 


The scaling of 


for pdnn is a simple application of Lemma 



□ 


3.2 Heat Kernel Defined by Pdnn 


The weights generated by the neural net represent “meta-features” formed from the features 
on H. Each hidden node generates important linear and non-linear combinations of the 
data, and contains much richer information than a single question or average over a few 
questions. 


One downside of traditional deep learning is that neural nets only account for the geometry 
of the observations. They ignore the geometry and clustering of the data points, opting to 
treat every point equally by using a simple mean of all points in the cost function ([^ . This 
can easily miss outliers and be heavily influenced by large clusters, especially with a small 
number of training points. 


The questionnaire from Section 2.1 organizes both the observations and the data points, 
though the meta-features from the questionnaire are simply averages of similar features as 
in (j^. However, the expert ranking assigned to each bin reflects a rough initial geometry 
onto the points to be learned by the neural net. 


The back propagation of the expert driven function is essential to building significant weight 
vectors in this regime of small datasets. When the ratio of the number of data points to the 
dimension of the features is relatively small, there are not enough training points to learn 
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the connections between all the features without considering the points themselves. The 
back propagation of the classihcation function generated from the questionnaire is a way to 
enforce the initial geometry of the data on the weight vectors. The hospital ranking example 
in Sectionj^demonstrates the need for back propagation, and Figure [^demonstrates the fact 
that the features from a simple SDAE are not sufficient for separating data points. 

For our algorithm, the SDAE is trained on D, the subset of points with “full” collection of 
features. This is to avoid training on reconstructed features which are subject to recon¬ 
struction error from Section 12.21 

Another problem with neural nets is that they can be highly unstable under parameter 
selection (or even random initialization). Two identical iterations can lead to completely 
different weights set. Along with that, back propagation can force points into isolated 
corners of the cube in [0,1]^. 

For this reason, we rerun the neural net K times with varied random seeds, number of 
hidden layers, sparsity parameters, and dropout percentages. After K iterations, we build 
the new set of features on points as D*(x) = ..., h^^\x)]. This dehnes an adjacency 

matrix on A with affinity dehned between two points as 

A(x,y) = (9) 

Along with that, the hnal ranking function on M comes from f(x) = i fi(x). Note 

that \\n*ix) - n*{y)\\^ = EliPDNNA^^yf- 

The expert driven heat kernel dehned in (j^ generates an embedding : D —?■ via the 

eigenvectors = [X\cj)i{x),X\(l)d{x)]- 

For each neural net hi, we keep the number of hidden layers small relative to the dimension 
of the data. This keeps the net from overhtting the data to the initial organization function 

g- 

3.3 Standardizing Distances to Build an Intrinsic Embedding 

While this generates a global embedding based off local geometry, it does not necessarily 
generate a homogenous space. In other words, ||^*(x) — <I>*(y)|| = ||<h*(x') — ‘h^(7/')|| does 
not necessarily guarantee that x and y differ by same amount as x' and yh This is because 
D*(x) and ^*{y) may differ in a large number of deep net features, whereas and 

may only differ in one or two features (though those features may be incredibly 
important for differentiation). 

For this reason, we must consider a local z-score of the regions of the data. For each point 
<I>*(x), there exists a mean and covariance matrix within a neighborhood Ux about x such 
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Ux 

1 

C4 


Z^Ux 


This generates a new whitened distance metric 

dt{x, y) = \ - yx) - {^\y) - iJ-y)] ^ (s;^ + st) [(^*(a;) - fj-x) - {^\y) - yy )], (lo) 

where is the Penrose-Moore psendoinverse of the covariance matrix. 


One can generate a hnal, locally standardized representation of the data via the diffnsion 
kernel 


W{x,y) 


— ^-'it{x,y)la ^ 


and the low freqnency eigenvalnes/eigenvectors of W, which we call {(sj,'0i)}- The hnal 
representation is denoted 

= [4Mx), •••, 4'^d{x)]. ( 11 ) 


3.4 Extension to New Points 


Once meta-featnres have been bnilt, extending the embedding to the rest of M can be done 
via an asymmetric affinity kernel, as described in [9]. For each x 0 11, o ne can calcnlate 
Q*{x) = [h4\x)]fLi and fi{x) with the weights generated in Section 
affinity between x and y G P via 


3.2 


This dehnes an 


The affinity matrix is a G then shows that the extension of the eigenvectors (j)i 

of A from ([^ to the rest of M is given by 

Mi, 


1/2 


A 


where A is the row stochastic normalization of a. 

Similarly, once the embedding is extended to all points in M, we can extend the hnal, 
standardized representation to the rest of M as well nsing the affinity matrix 

b{x,y) = exp{-]- [{<^>\x) - py) - (4>*(y) - /i^)] ^ [(4>*(x) - /ly) - {<^>\y) - py)] /a}, 
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for X G M and y ^ Q. Once again, the eigenvectors in 0 can be extended similarly 
via 

Si 

where B is the row stochastic normalized version of b. 

In this way, can be defined on all points M after analysis on the reference set fl. 


3.5 Different Approach to Deep Learning 


Our algorithm, as we will discuss in detail in Section uses the meta-features from stacked 
neural nets in a way not commonly considered in literature. Most algorithms use back 
propagation of a function to fine-tune weights matrices and improve the accuracy of the 
one dimensional classification function. However, in our algorithm, the purpose of the back 
propagation is not to improve classification accuracy, but instead to organize the data in 
such a way that is smooth relative to the classification function. In fact, we are most 
interested in the level sets of the classification function and understanding the organization 
of these level sets. 


At the same time, we are not building an auto-encoder that pools redundant and correlated 
features in an attempt to build an accurate, lossy compression (or expansion) of the data. 
Due to the high level of disagreement among the features, non-trivial features generated 
from an auto-encoder are effectively noise, as we see in Section 4.2 This is the motivation 
behind propagating an external notion of “quality”. 


4 Expert Driven Embeddings for Hospital Rankings 

4.1 Hospital Quality Ranking 

For preprocessing, every feature is mean centered and standard deviation normalized. Also, 
some of the features are posed in such a way that low scores are good, and others posed 
such that low scores are bad. For this reason, we “de-polarize” the features using principal 
component analysis. Let M = [xi,..., xat] be the data matrix. Then let U be the largest 
eigenvector of the covariance matrix 

X [ M* -M* ] , 

and take the half of the features i for which Ui > 0. This makes scores above the mean 
“good” regardless of whether the feature is posed as a positive or negative question. 


M 

-M 
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We begin by building a questionnaire on the hospitals. Our analysis focuses on the 2014 
CMS measures.We put a 2x weight on mortality features and 1.5x weight on readmission 
features due to importance, because the outcome measures describe the tangible results of 
a hospitalization are particularly important for patients. The questionnaire learns the rela¬ 
tionships between the hospitals, as well as the relationships between the different features. 
In doing this, we are able to build a partition tree of the hospitals, in which hospitals in 
the same node of the tree are more similar than hospitals in different nodes on the same 
level of the tree. As a side note, the weighting on mortality features only guarantees that 
the intra-bin variance of the mortality features is fairly low. 

For the ranking in this example, we use the 5*^ layer of a dyadic questionnaire tree, which 
gives 32 bins and pseudohospitals. Experts on CMS quality measures rank these pseudohos¬ 
pitals on multiple criteria and assign a quality score between 1 to 10 to each pseudohospitals. 
We use an average of 50 nodes per layer of the neural net. Also, we average the results 
of the neural nets over 100 trials. We shall refer to the hnal averaged ranking as the deep 
neural net ranking (DNN ranking) to avoid confusion. 


4.2 Two Step Embedding of Hospitals 


Figure shows the embedding of the subset of hospitals used in training the neural net. 
It is colored by the quality function assigned to those hospitals. The various prongs of the 
embedding are explained in Section [4.3.4[ Figure shows the embedding of the full set of 


hospitals. This was generated via the algorithm in Section 3.4 



Figure 4: Embedding of hospitals with at most r missing entries. Red corresponds to top 
quality, blue to bottom quality. 

Finally, while the goal of the hospital organization is to determine neighborhoods of similar 
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Figure 5: Embedding of full set of hospitals. Red corresponds to top quality, blue to bottom 
quality. 


hospitals, it is also necessary for all hospitals in a shared neighborhood to share a common 
quality rating. Figure plots the quality function assigned to the hospitals against the 
weighted average quality function of its neighborhood, where the weights come from the 
normalized affinities between the given hospitals and its neighbors. The strong collinear- 
ity demonstrates that the assigned quality function is consistent within neighborhoods of 
similar hospitals. 



Figure 6: Quality function assigned for hospital versus average quality across the neighbor¬ 
hood in Figure]^ 

To demonstrate that the expert input back propagation is necessary for a viable ranking 
and affinity, we include Figure Here, we build the same diffusion map embedding, but 
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on the features of the autoencoder before back propagation of the expert input function. 
Due to the small number of data points relative to the number of features, an untuned 
autoencoder fails to form relevant meta-features for the hospitals. 



Figure 7: Embedding of D without back propagation of expert driven hospitals rankings. 
Coloring comes from quality rating function. Notice that without back propagation the 
embedding is effectively noise compared to hospital quality. 


4.3 Internal Validation 

The problem with validating this composite rating of hospital quality is the lack of external 
ground truth. For this reason, we must rely on internal validation mechanisms to demon¬ 
strate consistency. There are two qualities necessary for a ranking mechanism: the affinity 
between hospitals cannot disagree too drastically with the original features, and similar 
hospitals must be of similar quality. 

For the rest of the section, we refer to several simple meta-features for comparison and 
validation. Row sum ranking refers to a simple ranking function 

from—sum 



where {xi} are the normalized and depolarized features for a given hospital x. Average 
process, survey, and outcome features are the same as the row sum score, but restricted 
only to features in the given category. NNLS weighted ranking refers to a function 

fNNLsix) = y^^WjXj, 
i 
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where the weights are calculated using non-negative least squares across the normalized and 
depolarized features, with the DNN ranking as the dependent variable. NNLS weighted pro¬ 
cess, survey, and outcome features are the same as the NNLS weighted score, but restricted 
only to features in the given category. NNLS was used for calculating the weights to avoid 
overfitting the DNN ranking by using negative weights. 

Also, our focus (unless otherwise stated) is on the subset of hospitals D that have more 
than 90% features reported. This is because a number of our validation steps use the row 
sum ranking and NNLS weighted ranking, which are more accurate when calculated over 
non-missing entries. 


4.3.1 DNN Satisfies Conditions for p 

Figure verifies that the back propagation neural net satisfies Condition 1 oi p from (|^. 
The plot shows the ranking of each hospital plotted against the average of its ten nearest 
neighbors under Euclidean distance between hospital profiles. The fact that the average 
strongly correlates with the original quality rating shows that the embeddings of the hos¬ 
pitals remain close if they are close under Euclidean distance. 



1 -'-'-'-'-'-'-'- 
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Figure 8: Quality function for hospital versus average quality function across closest Eu¬ 
clidean points. Demonstrates first condition necessary for p from (|^. 

Figure]^ shows that the back propagation neural net satisfies Condition 2 of p from ([^. The 
histogram on the left shows the DNN affinity between hospitals with different initial quality 
ratings (i.e. when g{x) / g{y)). To satisfy Condition 2, g{x) / g{y) =► p{x,y) > e^ass 
(i.e. A{x,y) < 1 — e). Also, for the histogram on the right we define 

P^{t) = Pr (A(x, y)>t: g{x) / g{y )), P={t) = Pr {A{x, y) > t : g{x) = g{y)) . 
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The histogram on the right shows . The smaller the ratio as t approaches 1, the 
stronger the influence of the initial ranking on the final affinity. 




Figure 9: (left) Histogram of DNN affinity between two points with different initial quality 
ratings, (right) Normalized histogram of DNN affinity between two points with different 
initial quality ratings divided by affinity between two points with same initial quality rating. 
If initial ranking was unimportant, bar graph would be concentrated around 1. Both images 
demonstrate second condition necessary for p from (|^. Note that, for scaling purposes, all 
values less than .1 have been removed from counting. 


Moreover, we can compare the contraction guaranteed by (|^ . For the hospital ratings, 

(||/(x) - f{y)f) = 2.44, E^ (||5(x) - giy)f) = 3.36, ^ = 1,22, C = 0.9959, 


which makes the right hand side of (|^ equal 2.15. 

Figure 10 indicates that the DNN metric from (|^ gives a better notion of small neighbor¬ 
hoods than a simple Euclidean metric. Each metric defines a transition probability P{x,y). 
For each point x, the plot finds 


minimize 

icn 

subject to 


'^P{x,y) > i. 

y&I 


( 12 ) 


It is important for (12) to be small, as that implies the metric generates tightly clustered 
neighborhoods. As one can see from Figure [T^ the DNN metric creates much more tightly 
clustered neighborhoods than a Euclidean metric. Figure gives summary statistics of 
these neighborhoods for varying diffusion scales. 
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Figure 10: Percent of points necessary to sum half the total transition probability. Metrics 
are generated in same fashion: K{x,y) = . Here (y = 'Ylx 11^ “ where yx 

is the 10*^ closest neighbor of x. 




Figure 11: Metrics are generated in same fashion: K{x,y) = Here at = 

~k'^x\\^ — yx\\, where yx is the closest neighbor of x. For both DNN embedding 
and Euclidean embedding metrics, the plots are (left) the average percentage of neighbors 
needed to contain half the transition probability from a given point to its neighborhood, 
and (right) the standard deviation of the number of neighbors needed to contain half the 
transition probability from a given point to its neighborhood. 


Another positive characteristic of our DNN embedding is the reduced local dimension of 
the data. Figure 12 plots the eigenvalues of the diffusion kernels At, where the 


Markov chain At describes either comes from the Euclidean metric or the DNN metric. 
These eigenvalues give us information about the intrinsic dimension of the data, as small 
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eigenvalues do not contribute to the overall diffusion. To compare the eigenvalues across 
different diffusion kernels, we normalize the eigenvalues by setting t = this is the 

average time it takes to diffuse across the system. Cutting off the eigenvalues to determine 
the dimension is fairly arbitrary without a distinct drop off, but an accepted heuristic is to 
set the cutoff at 


dim(At) = maxjfi G N : 5*^ > 0.01}. 

With this dehnition of dimension, dim(^t) = 7 for the DNN metric embedding, whereas 
dim(^t) = 14 for the Euclidean metric embedding. 



Figure 12: Eigenvalues of the embedding of the heat kernel At generated from 

the Euclidean metric (blue) and DNN metric (red). Each eigenvalue set is normalized by 
setting t = . 


4.3.2 Examples of “Best” and “Worst” Hospitals 


Row sum ranking is a poor metric for differentiation between most hospitals, but it is able 
to distinguish the extremal hospitals. In other words, hospitals with almost all features 
considerably above the mean would have obviously been assigned a 10 within any ranking 
system, and hospitals with almost all features below the mean would have obviously been 
assigned a 1 by any ranking system. 


Let us take the 25 best and worst hospitals under the row sum ranking, and consider how 
well these correspond to the hnal DNN rankings from Section 4.2 The average quality 


function value across the top 25 row sum rankings is 8.03, and the average quality function 
value across the bottom 25 row sum rankings is 1.75. 
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While these averages are indicative of the agreement between the DNN ranking and simple 
nnderstandings of hospital qnality, they are not perfect. For example, the top 25 row snm 
hospitals have an average DNN ranking of 8.03 becanse one of these hospitals has a DNN 
ranking of 4.84. This hospital is an interesting profile. Qnalitatively, its process features 
and snrvey featnres are above average, bnt their ontcomes are signihcantly worse than the 
mean. Even thongh only 16 of its 81 featnres are below the mean, the featnres for mortality 
from heart failnre and pnenmonia, and several measnres of hospital associated infection, 
are all more than 1 standard deviation below the mean. As all five of those featnres are 
heavily weighted by the algorithm and deemed crncial by experts, the hospital is assigned 
a low ranking despite having a strong row snm ranking. 

Fignrej^ shows the embeddings of hospitals withont the nenral net step. In other words, 
we simply weight the mortality and readmission featnres and calcnlate an affinity based 
off those featnre vectors. There are two important details. First, the embedding is very 
nnstrnctnred, implying a lack of strnctnre within the naive affinity matrix. Second, while 
the clond doesn’t necessarily reflect strnctnre, it does give a simple nnderstanding of local 
neighborhoods. 

Fignre [T^ also shows that onr new notion of rankings reflects a naive nnderstanding of 
hospital qnality. Namely, local Enclidean neighborhoods maintain the same ranking. Also, 
the 10 best and worst hospitals nnder the row snm ranking are circled. As expected, they 
exist on the two extremes of the embedding clond, and onr rankings agree with their notion 
of qnality. 


4.3.3 Dependence on Initial Rankings 


Another important featnre of a ranking algorithm is its robnstness across multiple rnns. To 
demonstrate stability of the nenral net section of the algorithm, we rnn mnltiple experiments 
with random parameters to test the stability of the qnality fnnction. Fignre 14 shows the 
qnality rating from one rnn of the algorithm (ie. the average ranking after 100 iterations 
of the nenral net) against the qnality rating from a separate rnn of the algorithm. In both 
rnns, 1,000 hospitals from D are randomly chosen for training the model and then tested 
on the other 614 hospitals in D. Fignre [T^ shows the rankings for all 1, 614 hospitals across 
both rnns. Given the strong similarity in both qnality rating and overall organization, it is 
clear that the average over 100 iterations of a nenral net is snfHcient to decide the qnality 
fnnction. 


It is also important to examine the dependence onr initial binning and ranking has on the 
hnal ranking of the hospitals. Clearly the initial binning is only meant to give approximate 
ranks, so a strong dependence on these rankings wonld be problematic. Table shows the 
confnsion matrix between the initial hospital rankings and the hnal rankings assigned from 
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Figure 13: Left: Embedding of hospitals using Euclidean distance on profiles, with 2x 
weighting on morality features and 1.5 x weighting on readmission features. Coloring is 
value of the quality function described in Section [4^ with top ratings being red and bottom 
ratings being blue. Right: Same embedding, with extremal points circled. Blue circles on 
10 hospitals with largest number of features above the mean, and red circles on 10 hospitals 
with fewest number of features above the mean. This is a rough characterization of “best” 
and “worst” hospitals. 



Final 

Rank 




8 

11 

0 

0 

First Rank 

15 

730 

129 

0 


0 

97 

330 

219 


0 

0 

27 

48 


Table 1: Confusion matrix between initial ranking on hospital bins and final ranking from 
neural net. For simplicity, rankings have been rounded into quartiles for purposes of con¬ 
fusion matrix. 

the neural net. The purpose of the neural net second step is to reclassify hospitals that are 
binned incorrectly due to spurious correlations. 

Let us examine a few of the hospitals that made the biggest jump. One of the hospitals 
increased from a rating of 3 to a DNN rating of 6.55. Several of its process features are well 
below average with a couple 2 standard deviation below average, and its survey features 
are average to slightly below average. For this reason, it was immediately classified as a 
poor hospital. However, its readmission features across the board are above average, with 
readmission due to heart failure 1.5 standard deviations above average. Also, mortality due 
to heart failure is a standard deviation above average, and all features of hospital associated 
infection are above the mean as well. In the initial ranking of the pseudohospitals, the 
fact that there are 29 survey features as compared to 5 readmission features brought the 
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Figure 14: (left) Original embedding colored by quality function. Same image as Figure 
1^ (right) New embedding of 100 nets trained independently from first run. Embedding 
is finally rotated to match orientation of the original via |6]. Quality rating coloring the 
second embedding is generated independently of left embedding. 


weighting down, even when giving extra weight to the readmission features. After the 
neural net ranking, the ranking is improved to 6.55. 

Another hospital dropped from a rating of 7 to a DNN rating of 3.4. Its process feature are 
all above average, and its readmission for heart failure is 1.5 standard deviations above the 
mean. Because readmission is weighted initially and there are a large number of process 
measures, the hospital was initially binned with better hospitals. However, its features for 
hospital associated infections, as well as mortality due to pneumonia and heart failure are 
a standard deviation below the mean, and its average survey score is a standard deviation 
below the mean as well. For these reasons, the DNN ranking dropped significantly. 


4.3.4 Affinity Dependence on Individual Features 

As another validation mechanism, we consider the smoothness of each feature against the 
metric p. For each feature /*, we estimate the Lipschitz constant Li such that 

\fi{x) - fi{y)\ < Li\\x - vWdnn, 

where || • Udatv is simply the Euclidean distance in the diffusion embedding space of the 
DNN affinity matrix A. Once / is normalized by range(/), Li gives a measure of the 
smoothness of fi under our new metric. 


23 




Lipschitz Constant Under p 


Feature 

Lipschitz Constant 

Quality Rating Function 

0.124 

REALM 30 HOSP WIDE 

0.303 

MORT 30 HE 

0.346 

READM 30 HF 

0.350 

MORT 30 AMI 

0.371 

MORT 30 PN 

0.384 

READM 30 PN 

0.398 

READM 30 AMI 

0.430 

READM 30 HIP KNEE 

0.440 

H HSP RATING 0 6 

0.582 

H RECMND DN 

0.595 

H COMP 1 SN 

0.618 

H HSP RATING 9 10 

0.620 

H COMP 3 SN 

0.688 

H RECMND DY 

0.691 

H COMP 1 A 

0.712 

H QUIET HSP SN 

0.735 

VTE 5 

0.740 

PSI 90 SAFETY 

0.758 

H COMP 2 SN 

0.758 

H COMP 4 SN 

0.761 

COMP HIP KNEE 

0.762 

STK 8 

0.763 

ED lb 

0.766 

OP 20 

0.789 

ED 2b 

0.795 

PSI 4 SURG COMP 

0.807 

VTE 3 

0.807 

H RECMND PY 

0.811 

H HSP RATING 7 8 

0.831 

OP 21 

0.832 

STK 1 

0.834 

HAI 5 

0.838 

H CLEAN HSP SN 

0.839 

H COMP 5 SN 

0.844 

HAI 2 

0.846 

HAI 3 

0.868 

H COMP 6 Y 

0.870 

H COMP 6 N 

0.870 

HAI 1 

0.875 

STK 6 

0.889 

H COMP 4 A 

0.898 

PC 01 

0.904 

VTE 2 

0.906 


Feature 

Lipschitz Constant 

H COMP 1U 

0.908 

OP 11 

0.913 

OP 9 

0.913 

H COMP 3 A 

0.918 

H COMP 2 A 

0.931 

OP 13 

0.934 

OP 18b 

0.935 

STK 10 

0.942 

H QUIET HSP A 

0.946 

OP 10 

0.978 

H COMP 5 A 

0.996 

H CLEAN HSP A 

1.023 

HF 3 

1.077 

AMI 2 

1.081 

HAI 6 

1.088 

SCIP INF 9 

1.102 

VTE 4 

1.106 

AMI 10 

1.133 

OP 6 

1.147 

H COMP 2 U 

1.186 

SCIP CAR 

1.192 

VTE 1 

1.221 

H CLEAN HSP U 

1.225 

OP 22 

1.227 

H QUIET HSP U 

1.258 

SPP 

1.264 

IMM lA 

1.272 

IMM 2 

1.284 

STK 2 

1.288 

OP 7 

1.291 

H COMP 4 U 

1.325 

H COMP 3 U 

1.365 

HF 1 

1.430 

STK 5 

1.443 

PN 3b 

1.545 

H COMP 5 U 

1.566 

SCIP INF 2 

1.994 

SCIP INF 3 

2.001 

PN 6 

2.066 

SCIP INF 1 

2.131 

SCIP VTE 

2.151 

SCIP INF 10 

3.333 

HF 2 

5.111 


From the Lipschitz constant, we can conclude that the most influential features are the 
mortality and readmission scores, followed closely by survey scores and post surgical 
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infection scores. While the process features have some smoothness, the Lipschitz 
constants are signihcantly larger than outcome and some survey scores. 



Figure 15: Hospital Diffusion embedding colored by average across 34 process features. 


Almost all the hospitals in Figure [TS] have an average process score within a half standard 
deviation of the mean, which is well within acceptable levels. For this reason, these 
features are not as strong of features as some of the other features. However, the few 
hospitals that are well below the mean have middle to poor quality ratings. 


Figures 16 and 17 show the embedding of the hospitals colored by their average survey 
and average outcome features. These are more strongly correlated with the overall quality 
function. 



Figure 16: Hospital Diffusion embedding colored by average across 29 survey features. 


Note that the average across all features from a category is not the ideal description of 
that category. For a given hospital, there can be signihcant variability within a category 
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Figure 17: Hospital Diffusion embedding colored by average across 15 outcome features. 


which would not be captured by the mean. The category average is only meant to 
demonstrate a general trend. Figure [T^ shows a weighted average of the readmission and 
mortality features, with the weights determined by a least squares fit with the DNN 
quality function as a dependent variable. 



123456789 
Quality Rating 



Figure 18: (left) Scatter plots of quality function vs least squares solution with mortality and 
readmission as independent variables and quality function as dependent variable, (right) 
Hospital Diffusion embedding colored by same least squares solution. 

We can also use these embeddings, along with Figure to show why the embedding has 
the two prongs on the left side of the image among the well ranked hospitals. Notice from 
Figure [T^ that readmission is very strong along the lower prong, while mortality scores are 
very strong along the upper prong. Moreover, there is some level of disjointness between 
the two scores |in) . 

This begs the question of why the strong readmission cluster is ranked higher than the 
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Figure 19: (left) Hospital Diffusion embedding colored by hospital wide readmission (right) 
Hospital Diffusion embedding colored by mortality from heart failure. 


strong mortality cluster. That can be explained in Figure 16 The survey features are 
much stronger along the lower cluster with high readmission than they are in the 
mortality cluster. This explains the cluster of hospitals receiving top quality ratings 
despite having average to slightly above average mortality scores. The same can be said, 
to a lesser degree, for the process scores in Figure 15 So the differentiation in rankings is 


derived from the fact that, at their highest levels, more of the features agree with 
readmission features than they do with mortality features. 


The left plot in Figure represents each hospital with three features: the mean score 
across process features, the mean score across survey features, and the mean score across 
outcome features. The plot is colored by the quality function, and it is clear that the 
rankings reflect the trend of these three average features. The right plot in Figure [20] 
represents each hospital with the NNLS process, survey, and outcome scores. Table list 
those features that have non-trivial weights with p-value < .05. 


5 Conclusion 

We introduced an algorithm for generating new metrics and diffusion embeddings based 
off of expert ranking. Our algorithm incorporates both data point geometry via 
hierarchical diffusion geometry and non-linear meta-features via stacked neural nets. The 
resulting embedding and rankings represent a propagation of the expert rankings to all 
data points, and the resulting metric generated by the stacked neural net gives a Lipschitz 
representation with respect to Euclidean distance that learns important and irrelevant 
features according to the expert opinions and in automated fashion. 
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Figure 20: (left) Plot of mean scores across process, survey, and outcome measures, colored 
by the quality function, (right) Plot of process, survey, and outcome measures with weights 
determined by non negative least squares with the quality function as dependent variable, 
colored by quality rating. 


Although the ranking algorithm seems tied to the process of expert rankings of hospitals, 
the underlying idea of generating metrics in the form of ([^. and propagating hrst pass 
rankings to the rest of the data, is quite general. For example, this method could be used 
to propagate sparsely labeled data to the rest of the dataset, with expert bin rankings 
replaced by the mode of the labelled data in the bin. 

This method also touches on the importance of incorporating data point organization into 
the neural net framework when dealing with smaller data sets and noisy features. 

Without the expert driven function to roughly organize the data points, the stacked 
autoencoder fails to determine any relevant features for separation, as shown in Figure 

We will examine the implications of our hospital ratings on health policy, as well as 
discuss the various types of hospitals in our embedding, in a future paper. Future work 
will also examine further examination of the influence of data point organization on 
neural nets and the generation of meta-features. Also, it would be interesting to examine 
other applications of propagation of qualitative rankings and measures. 
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Features 

Coefficient Magnitude 

HAI 3 

0.029 

HAI 6 

0.010 

REALM 30 AMI 

0.148 

REALM 30 HE 

0.139 

REALM 30 HIP KNEE 

0.148 

REALM 30 HOSP WILE 

0.122 

REALM 30 PN 

0.074 

PSI 4 SURG COMP 

0.018 

PSI 90 SAFETY 

0.003 

MORT 30 HE 

0.116 

MORT 30 PN 

0.121 

MORT 30 AMI 

0.072 

SPP 

0.053 

OP 10 

0.045 

OP 11 

0.080 

OP 13 

0.003 

AMI 10 

0.069 

HE 1 

0.058 

HE 3 

0.136 

STK 1 

0.136 

STK 5 

0.040 

STK 6 

0.156 

STK 8 

0.059 

STK 10 

0.083 

VTE 2 

0.183 

VTE 3 

0.126 

VTE 4 

0.002 

PN 6 

0.112 

SCIP CAR 

0.074 

IMM 2 

0.239 

OP 6 

0.077 

OP 7 

0.009 

OP 21 

0.028 

PC 01 

0.108 

H CLEAN HSP SN 

0.171 

H COMP 1 SN 

0.141 

H COMP 2 SN 

0.082 

H COMP 5 SN 

0.119 

H COMP 6 Y 

0.156 

H HSP RATING 7 8 

0.185 

H RECMNL LN 

0.176 

H RECMNL PY 

0.123 


Table 2: Significant features from Non Negative Least Squares with Quality Function as 
the Dependent Variable 
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